-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
failover: emit events when pd failover #1466
Conversation
@@ -76,11 +76,11 @@ func NewController( | |||
tikvFailoverPeriod time.Duration, | |||
tidbFailoverPeriod time.Duration, | |||
) *Controller { | |||
eventBroadcaster := record.NewBroadcaster() | |||
eventBroadcaster.StartLogging(glog.Infof) | |||
eventBroadcaster := record.NewBroadcasterWithCorrelatorOptions(record.CorrelatorOptions{QPS: 10}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default QPS
is too low: 1/ 300
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 events per object per seconds seems too large, it might generate a lot of load on the API server or unable to do other CRUD requests because of exceeding the total throttling limit. Most burstable tokens (25) are consumed by unnecessary events. We can increase the burst and QPS together but not too much at the first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, 10 is too large.
I will set them to proper values after removing those events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the current rate of events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rate of events
, does it mean the default value of QPS
and BurstSize
?
// by default, allow a source to send 25 events about an object
// but control the refill rate to 1 new event every 5 minutes
// this helps control the long-tail of events for things that are always
// unhealthy
defaultSpamBurst = 25
defaultSpamQPS = 1. / 300.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create an issue about setting the rate to proper values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an issue opened
eventBroadcaster := record.NewBroadcaster() | ||
eventBroadcaster.StartLogging(glog.Infof) | ||
eventBroadcaster := record.NewBroadcasterWithCorrelatorOptions(record.CorrelatorOptions{QPS: 10}) | ||
eventBroadcaster.StartLogging(glog.V(4).Infof) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
glog.Info
is too noisy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
events are often the most important messages that end users care, I'd prefer to have it on in logs by default, how about v(2)
?
If the logging of the event is too noisy, we should reduce the rate of events we reported to the API server.
Another reason is the event can be dropped, in that case, we can find the full events from the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reasonable
eventBroadcaster.StartRecordingToSink(&eventv1.EventSinkImpl{ | ||
Interface: eventv1.New(kubeCli.CoreV1().RESTClient()).Events("")}) | ||
recorder := eventBroadcaster.NewRecorder(v1alpha1.Scheme, corev1.EventSource{Component: "tidbcluster"}) | ||
recorder := eventBroadcaster.NewRecorder(v1alpha1.Scheme, corev1.EventSource{Component: "tidb-controller-manager"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like k8s does
deddce8
to
304ed5b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest LGTM
pf.recorder.Eventf(tc, apiv1.EventTypeWarning, "PDMemberUnhealthy", | ||
"%s(%s) is unhealthy", podName, pdMember.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, this is better to consider as a status
instead of event
, recording this will be too noisy (despite the flow control, the controller will emit an event in each round of control loop if there is an unhealthy PD member).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The is already a .status.pd[].health
attribute.
Emit it as an event is good for kubectl describe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, seems like it is already in the describe result?
Members:
Yutengqiu - Demo - Pd - 0:
Client URL: http://yutengqiu-demo-pd-0.yutengqiu-demo-pd-peer.yutengqiu.svc:2379
Health: true
Id: 12697782363740270066
Last Transition Time: 2020-01-06T01:48:58Z
Name: yutengqiu-demo-pd-0
Yutengqiu - Demo - Pd - 3:
Client URL: http://yutengqiu-demo-pd-3.yutengqiu-demo-pd-peer.yutengqiu.svc:2379
Health: true
Id: 10833084519111696661
Last Transition Time: 2020-01-06T05:37:58Z
Name: yutengqiu-demo-pd-3
Yutengqiu - Demo - Pd - 4:
Client URL: http://yutengqiu-demo-pd-4.yutengqiu-demo-pd-peer.yutengqiu.svc:2379
Health: true
Id: 10563190389194377650
Last Transition Time: 2020-01-06T05:44:25Z
Name: yutengqiu-demo-pd-4
Yutengqiu - Demo - Pd - 6:
Client URL: http://yutengqiu-demo-pd-6.yutengqiu-demo-pd-peer.yutengqiu.svc:2379
Health: true
Id: 6735927804110166558
Last Transition Time: 2020-01-06T05:32:10Z
Name: yutengqiu-demo-pd-6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emit an event is more user-friendly, it is a progress of failover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or change PDMemberUnhealthy
to a more proper name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the progress of failover makes sense, it is okay for me to emit events based on the the status of PD when we cannot actually capture the the "PD turning from healthy to unhealthy" event.
The naming issue is just trivial, I think current name is ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to report the event on state change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good suggestion, we can do this in the syncTidbClusterStatus
method, not failover method, an issue opened: #1495
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…operator into pd-failover-event
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/merge |
Your auto merge job has been accepted, waiting for 1486, 1484, 1493 |
/run-all-tests |
@weekface merge failed. |
/run-all-tests |
/merge |
Your auto merge job has been accepted, waiting for 1486 |
/run-all-tests |
cherry pick to release-1.1 in PR #1507 |
* emit event when pd failover * fix gofmt * fix gofmt * fix gofmt * address comment * address comment * fix CI issue
What problem does this PR solve?
fixes: #1465
This PR tries to emit three k8s
Events
:PDMemberUnhealthy
,PDMemberMarkedAsFailure
andPDMemberDeleted
to indicate the pdfailover
procedure when the use typekubectl describe tc
:But there is a problem that there are many not very useful
Events
:SuccessfulUpdate
, I will open another PR to remove them. @cofyc @aylei PTAL.What is changed and how does it work?
Check List
Tests
Code changes
Side effects
Related changes
Does this PR introduce a user-facing change?: